A comparison of sub-word indexing methods for information retrieval

نویسنده

  • Johannes Leveling
چکیده

This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels and consonants, 2) a dictionary-based approach for decompounding, 3) overlapping character n-grams, and 4) Knuth’s algorithm for hyphenation. The performance and effects of sub-word extraction on search time and index size and time are reported for English and German retrieval experiments. The main results are: For English, indexing sub-words does not outperform the baseline using standard retrieval on stemmed word forms (–8% mean average precision (MAP), – 11% geometric MAP (GMAP), +1% relevant and retrieved documents (rel ret) for the best experiment). For German, with the exception of n-grams, all methods for indexing sub-words achieve a higher performance than the stemming baseline. The best performing sub-word indexing methods are to use consonant-vowelconsonant sequences and index them together with word stems (+17% MAP, +37% GMAP, +14% rel ret compared to the baseline), or to index syllable-like sub-words obtained from the hyphenation algorithm together with stems (+9% MAP, +23% GMAP, +11% rel ret).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Content Based Radiographic Images Indexing and Retrieval Using Pattern Orientation Histogram

Introduction: Content Based Image Retrieval (CBIR) is a method of image searching and retrieval in a  database. In medical applications, CBIR is a tool used by physicians to compare the previous and current  medical images associated with patients pathological conditions. As the volume of pictorial information  stored in medical image databases is in progress, efficient image indexing and retri...

متن کامل

Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio

We explore the problem of out of vocabulary (OOV) queries in audio indexing systems by comparing three indexing methods on a broadcast news repository containing 75 hours of audio. Our systems are word-based, phoneme-based and a novel system based on syllable-like units called particles. To better examine the performance of these three approaches we use a query set where the percentage of OOVs ...

متن کامل

بررسی تأثیرات ریشه‌یابی در بازیابی اطلاعات در زبان فارسی

Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...

متن کامل

Fast decoding for open vocabulary spoken term detection

Information retrieval and spoken-term detection from audio such as broadcast news, telephone conversations, conference calls, and meetings are of great interest to the academic, government, and business communities. Motivated by the requirement for high-quality indexes, this study explores the effect of using both word and sub-word information to find in-vocabulary and OOV query terms. It also ...

متن کامل

Evaluation of Stop Word Lists in Chinese Language

In modern information retrieval systems, effective indexing can be achieved by removal of stop words. Till now many stop word lists have been developed for English language. However, no standard stop word list has been constructed for Chinese language yet. With the fast development of information retrieval in Chinese language, exploring the evaluation of Chinese stop word lists becomes critical...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009